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METHOD AND SYSTEM FOR PITCH CONTOUR QUANTIZATION 

IN AUDIO CODING 

Cross References to Related Applications 

This application is related to U.S. patent application docket number 944-003.182, 
entitled "Method and System for Speech Coding", which is assigned to the assignee of this 
application and filed even date herewith. 

Field of the Invention 

The present invention relates generally to a speech coder and, more specifically, to 
a speech coder that allows a sufficiently long encoding delay. 

Background of the Invention 

It will become required in the United States to take visually impaired persons into 
consideration when designing mobile phones. Manufactures of mobile phones must offer 
phones with a user interface suitable for a visually impaired user. In practice, this means 
that the menus are "spoken aloud" in addition to being displayed on the screen. It is 
obviously beneficial to store these audible messages in as little memory as possible. 
Typically, text- to- speech (TTS) algorithms have been considered for this application. 
However, to achieve reasonable quality TTS output, enormous databases are needed and, 
therefore, TTS is not a convenient solution for mobile terminals. With low memory usage, 
the quality provided by current TTS algorithms is not acceptable. 

Besides TTS, a speech coder can be utilized to compress pre-recorded messages. 
This compressed information is saved and decoded in the mobile terminal to produce the 
output speech. For minimum memory consumption, very low bit rate coders would be 
desired. To generate the input speech signal to the coding system, either human speakers 
or high-quality (and high-complexity) TTS algorithms can be used. 

In a typical speech coder, the input speech signal is processed in fixed-length 
segments called frames. In current speech coders the frame length is usually 10-30 ms, and 
a lookahead segment of around 5-15 ms from the subsequent frame may also be available. 
The frame may further be divided into a number of subframes. For every frame, the 
encoder determines a parametric representation of the input signal. The parameters are 
quantized, and transmitted through a communication channel or stored in a storage 
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medium. At the receiving end, the decoder constructs a synthesized signal based on the 
received parameters, as shown in Figure 1 . 

While one underlying goal of speech coding is to achieve the best possible quality 
at a given coding rate, other performance aspects also have to be considered in developing 
5 a speech coder to a certain application. In addition to speech quality and bit rate, the main 
attributes described in more detail below include coder delay (defined mainly by the frame 
size plus a possible lookahead), complexity and memory requirements of the coder, 
sensitivity to channel errors, robustness to acoustic background noise, and the bandwidth 
of the coded speech. Also, a speech coder should be able to efficiently reproduce input 

10 signals with different energy levels and frequency characteristics. 

Quantization of the pitch contour is a task that is required in almost all practical 
speech coders. The pitch parameter is related to the fundamental frequency of speech: 
during voiced speech, the pitch corresponds to the fundamental frequency and can be 
perceived as the pitch of speech. During purely unvoiced speech, there is no fundamental 

15 frequency in a physical sense and the concept of pitch is vague. In most speech coders, 

however, the "pitch information" is also needed during unvoiced speech. For example, in 
coders based on the well-known code excited linear prediction (CELP) approach, the long 
term prediction lag (roughly corresponding to pitch) is also transmitted during unvoiced 
portions of speech. 

20 In a typical speech coder, the pitch parameter is estimated from the signal at regular 
intervals. The pitch estimators used in speech coders can roughly be divided into the 
following categories: (i) pitch estimators utilizing the time domain properties of speech, 
(«) pitch estimators utilizing the frequency domain properties of speech, (Hi) pitch 
estimators utilizing both the time and frequency domain properties of speech. 

25 The most common prior-art solution to the quantization of the pitch contour (pitch 

values estimated at regular intervals) is to use scalar quantization. Typically, a single 
quantizer is used for all pitch values and the transmission rate is held fixed. Alternative 
solutions have also been proposed. For example, every second pitch value can be 
quantized using a scalar quantizer and the values between these can be coded with a 

30 differential quantizer. In some of the existing encoders, the quantizer contained two 
modes, a memoryless mode and a predictive mode. These techniques offer some 
advantages, when compared to the basic approach, but the redundancies are only partially 
exploited. 
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The main drawback of the prior art is that the conventional quantization techniques 
with fixed update rates are inherently inefficient because there is a lot of redundancy in the 
pitch values transmitted. The fixed update rate used in the quantization of the pitch 
parameter is usually rather high (about 50 to 100 Hz) in order to be able to handle cases in 
5 which the pitch changes rapidly. However, rapid variations in the pitch contour are 

relatively rare. Consequently, a much lower update rate could be used most of the time. 

Summary of the Invention 

The present invention exploits the fact that a typical pitch contour evolves fairly 

10 smoothly but contains occasional rapid changes. Thus, it is possible to construct a piece- 
wise pitch contour that closely follows the shape of the original contour but contain less 
information to be coded. Instead of coding every pitch of the pitch contour, only the 
points defining the piece-wise pitch contour where the derivative changes are quantized. 
During unvoiced speech, a constant default pitch value can be used both at the encoder 

15 and at the decoder. The segments on the piece- wise pitch contour can be linear or non- 
linear. 

Thus, according to the first aspect of the present invention, there is provided a 
method for improving coding efficiency in audio coding, wherein an audio signal is 
encoded for providing parameters indicative of the audio signal, the parameters including 
20 pitch contour data containing a plurality of pitch values representative of an audio segment 
in time. The method comprises the steps of: 

creating, based on the pitch contour data, a plurality of simplified pitch contour 
segment candidates, each candidate corresponding to a sub-segment of the audio signal; 

measuring deviation between each of the simplified pitch contour segment 
25 candidates and said pitch values in the corresponding sub-segment; 

selecting one of said candidates based on the measured deviations and one or more 
pre-selected criteria; and 

coding the pitch contour data in the sub-segment of the audio signal corresponding 
to the selected candidate with characteristics of the selected candidate. 
30 According to one embodiment of the present invention, the pitch contour data in 

the audio segment in time is approximated by a plurality of selected candidates, 
corresponding to a plurality of consecutive sub-segments in said audio segment, each of 
said plurality of selected candidates defined by a first end point and a second end point, 
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and wherein said coding comprises the step of providing information indicative of the end 
points so as to allow the decoder to reconstruct the audio signal in the audio segment 
based on the information instead of the pitch contour data. The number of pitch values in 
some of the consecutive sub-segment is equal to or greater than 3. 
5 According to one embodiment of the present invention, the creating step is limited 

by a pre-selected condition such that the deviation between each of the simplified pitch 
contour segment candidates and each of said pitch values in the corresponding sub- 
segment is smaller than or equal to a pre-determined maximum value. 

According to one embodiment of the present invention, the created segment 
10 candidates have various lengths, and said selecting is based on the lengths of the segment 
candidates, and the pre-selected criteria include that the selected candidate has the 
maximum length among the segment candidates. 

According to one embodiment of the present invention, the selecting step is based 
on the lengths of the segment candidates, and the pre-selected criteria include that the 
15 measured deviation is minimum among a group of the candidates having the same length. 

According to one embodiment of the present invention, each of the simplified pitch 
contour segment candidates has a starting point and an end point, and said creating is 
carried out by adjusting the end point of the segment candidates. 

The audio signal comprises a speech signal. 
20 According to the second aspect of the present invention, there is provided a coding 

device encoding an audio signal, comprising pitch contour data containing a plurality of 
pitch values representative of an audio segment in time. The coding device comprises: 

an input end for receiving the pitch contour data; 

a data processing module, responsive to the pitch contour data, for creating a 
25 plurality of simplified pitch contour segment candidates, each candidate corresponding to 
a sub-segment of the audio signal, wherein the processing module comprises: 

an algorithm for measuring deviation between each of the simplified pitch 
contour segment candidates and said pitch values in the corresponding sub- 
segment; and 

30 an algorithm for selecting one of said candidates based on the measured 

deviations and pre-selected criteria; and 
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a quantization module, responsive to the selected candidate, for coding the pitch 
contour data in the sub-segment of the audio signal corresponding to the selected 
candidate with characteristics of the selected candidate. 

According to one embodiment of the present invention, the quantization module 
5 provides audio data indicative of the coded pitch contour data in the sub-segment. The 
coding device further comprises 

a storage device, operatively connected to the quantization module to receive the 
audio data, for storing the audio data in a storage medium. 

According to another embodiment of the present invention, the coding device 
10 further comprises an output end, operatively connected to a storage medium, for providing 
the coded pitch contour data to the storage medium for storage. 

According to yet another embodiment of the present invention, the coding device 
further comprises an output end for transmitting the coded pitch contour data to the 
decoder so as to allow the decoder to reconstruct the audio signal also based on the coded 
1 5 pitch contour data. 

According to the third aspect of the present invention, there is provided a computer 
software product embodied in an electronically readable medium for use in conjunction 
with an audio coding device, the audio coding device providing parameters indicative of 
the audio signal, the parameters including pitch contour data containing a plurality of pitch 
20 values representative of an audio segment in time. The software product comprises: 

a code for creating a plurality of simplified pitch contour segment candidates based 
on the pitch contour data, each candidate corresponding to a sub-segment of the audio 
signal; 

a code for measuring deviation between each of the simplified pitch contour 
25 segment candidates and said pitch values in the corresponding sub-segment; and 

a code for selecting one of said candidates based on the measured deviations and 
pre-selected criteria, so as to allow a quantization module to code the pitch contour data in 
the sub-segment of the audio signal corresponding to the selected candidate with 
characteristics of the selected candidate. 
30 According to the fourth aspect of the present invention, there is provided a decoder 

for reconstructing an audio signal, wherein the audio signal is encoded for providing 
parameters indicative of the audio signal, the parameters including pitch contour data 
containing a plurality of pitch values representative of an audio segment in time, and 



PATENT 
944-003.191 

wherein the pitch contour data in the audio segment in time is approximated by a plurality 
of consecutive sub-segments in the audio segment, each of said sub-segments defined by a 
first end point and a second end point. The decoder comprises: 

an input for receiving audio data indicative of the end points defining the sub- 
5 segments; and 

reconstructing the audio segment based on the received audio data. 

According to one embodiment of the present invention, the audio data is recorded 
on an electronic media, and the input of the decoder is operatively connected to electronic 
media for receiving the audio data. 
10 According to another embodiment of the present invention, the audio data is 

transmitted through a communication channel, and the input of the decoder is operatively 
connected to the communication channel for receiving the audio data. 

According to the fifth aspect of the present invention, there is provided an 
electronic device, comprising: 
15 a decoder for reconstructing an audio signal, wherein the audio signal is encoded 

for providing parameters indicative of the audio signal, the parameters including pitch 
contour data containing a plurality of pitch values representative of an audio segment in 
time, and wherein the pitch contour data in the audio segment in time is approximated by a 
plurality of consecutive sub-segments in the audio segment, each of said sub-segments 
20 defined by a first end point and a second end point, so as to allow the audio segment to be 
constructed based on the end points defining the sub-segments; and 

an input for receiving audio data indicative of the end points and for providing the 
audio data to the decoder. 

According to one embodiment of the present invention, the audio data is recorded 
25 in an electronic medium, and the input is operatively connected to the electronic medium 
for receiving the audio data. 

According to another embodiment of the present invention, the audio data is 
transmitted through a communication channel, and the input is operatively connected to 
the communication channel for receiving the audio data. 
30 The electronic device can be a mobile terminal or a module for terminal. 

According to the sixth aspect of the present invention, there is provided a 
communication network, comprising: 

a plurality of base stations; and 
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a plurality of mobile stations communicating with the base stations, wherein at 
least one of the mobile stations comprises: 

a decoder for reconstructing an audio signal, wherein the audio signal is 
encoded for providing parameters indicative of the audio signal, the parameters 
including pitch contour data containing a plurality of pitch values representative of 
an audio segment in time, and wherein the pitch contour data in the audio segment 
in time is approximated by a plurality of consecutive sub-segments in the audio 
segment, each of said sub-segments defined by a first end point and a second end 
point, so as to allow the audio segment to be constructed based on the end points 
defining the sub-segments; and 

an input for receiving audio data indicative of the end points from at least one of 
the base stations for providing the audio data to the decoder. 

The present invention will become apparent upon reading the description taken in 
conjunction with Figures 2 to 6. 

Brief Description of the Drawings 

Figure 1 is a block diagram showing a prior art speech coding system. 

Figure 2 is an example of a piece-wise pitch contour according to one embodiment 
of the present invention. 

Figure 3 is a block diagram showing a speech coding system, according to one 
embodiment of the present invention. 

Figure 4 is a flowchart illustrating an example of an iteration process for 
generating a piece-wise pitch contour. 

Figure 5 is a flowchart illustrating an example of an iteration process for 
generating a 

piece-wise pitch contour based on an optimal simplified model. 

Figure 6 is a schematic representation showing a communication network capable 
of carrying out the present invention. 

Best Mode for Carrying Out the Invention 

With a piece-wise linear pitch contour, only those points of the contour where 
there are derivative changes are transmitted to the decoder. Accordingly, the update rate 

7 
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required for the pitch parameter is significantly reduced. In principle, the piece-wise 
linear contour is constructed in such a manner that the number of derivative changes is 
minimized while maintaining the deviation from the "true pitch contour" below a pre- 
specified limit. To obtain globally optimal results, the lookahead should be very long and 
5 the optimization would require large amounts of computation. However, very good results 
can be achieved with the very simple technique described in this section. The description 
is based on an implementation used in a speech coder designed for storage of pre-recorded 
audio messages. 

A simple but efficient optimization technique for constructing the piece-wise linear pitch 

10 contour can be obtained by going through the process one linear segment at a time. For 
each linear segment, the maximum length line (that can keep the deviation from the true 
contour low enough) is searched without using knowledge of the contour outside the 
boundaries of the linear segment. Within this optimization technique, there are two cases 
that have to be considered: the first linear segment and the other linear segments. 

15 The case of the first linear segment occurs at the beginning when the encoding 

process is started. In addition, if no pitch values are transmitted for inactive or unvoiced 
speech, the first segment after these pauses in the pitch transmission fall to this category. 
In both situations, both ends of the line can be optimized. Other cases fall in to the second 
category in which the starting point for the line has already been fixed and only the 

20 location of the end point can be optimized. 

In the case of the first linear segment, the process is started by selecting the first 
two pitch values as the best end points for the line found so far. Then, the actual iteration 
is started by considering the cases where the ends of the line are near the first and the third 
pitch values. The candidates for the starting point for the line are all the quantized pitch 

25 values that are close enough to the first original pitch value such that the criterion for the 
desired accuracy is satisfied. Similarly, the candidates for the end point are the quantized 
pitch values that are close enough to the third original pitch value. After the candidates 
have been found, all the possible start point and end point combinations are tried out: the 
accuracy of linear representation is measured at each original pitch location and the line 

30 can be accepted as a part of the piece-wise linear contour if the accuracy criterion is 

satisfied at all of these locations. Furthermore, if the deviation between the current line 
and the original pitch contour is smaller than the deviation with any one of the other lines 
accepted during this iteration step, the current line is selected as the best line found so far. 
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If at least one of the lines tried out is accepted, the iteration is continued by repeating the 
process after taking one more pitch value to the segment. If none of the alternatives is 
acceptable, the optimization process is terminated and the best end points found during the 
optimization are selected as points of the piece-wise linear pitch contour. 
5 In the case of other segments, only the location of the end point can be optimized. 

The process is started by selecting the first pitch value after the fixed starting point as the 
best end point for the line found so far. Then, the iteration is started by taking one more 
pitch value into consideration. The candidates for the end point for the line are the 
quantized pitch values that are close enough to the original pitch value at that location 

10 such that the criterion for the desired accuracy is satisfied. After finding the candidates, 
all of them are tried out as the end point. The accuracy of linear representation is 
measured at each original pitch location and the candidate line can be accepted as a part of 
the piece-wise linear contour if the accuracy criterion is satisfied at all of these locations. 
In addition, if the deviation from the original pitch contour is smaller than with the other 

15 lines tried out during this iteration step, the end point candidate is selected as the best end 
point found so far. If at least one of the lines tried out is accepted, the iteration is 
continued by repeating the process after taking one more pitch value to the segment. If 
none of the alternatives is acceptable, the optimization process is terminated and the best 
end point found during the optimization is selected as a point of the piece-wise linear pitch 

20 contour. 

In both cases described above in detail, the iteration can be finished prematurely 
for two reasons. First, the process is terminated if no more successive pitch values are 
available. This may happen if the whole lookahead has been used, if the speech encoding 
has ended, or if the pitch transmission has been paused during inactive or unvoiced 
25 speech. Second, it is possible to limit the maximum length of a single linear part in order 
to code the point locations more efficiently. For both cases, these issues can be taken into 
account by setting a limit /max to the iteration number i based on the number of pitch values 
available and on the maximum time-distance between the ends of the line. The iteration is 
shown in Figure 4. 

30 After finding a new point of the piece-wise linear pitch contour, the point can be 

coded into the bitstream. Two values must be given for each point: the pitch value at that 
point and the time-distance between the new point and the previous point of the contour. 
Naturally, the time-distance does not have to be coded for the first point of the contour. 
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The pitch value can be conveniently coded using a scalar quantizer. In the implementation 
used in the coder designed for storage of audio menus, each time distance value is coded 
using riog20max)l bits. If desired, it is also possible to use some lossless coding, such as 
Huffman coding, on the time distance values. The pitch values are coded using scalar 
5 quantization. The scalar quantizer contained 32 levels (5 bits) obtained using 

I 8000 J 

where n runs from 2 to 32 and p{\) = 19 samples. Thus, more distortion is allowed for low 
pitch frequencies, to take into account the properties of human hearing. Moreover, the 

10 known features of the human auditory system are exploited by performing the distortion 
measurements during the pitch quantization in the logarithmic domain. 

An example of the piece-wise pitch contour, according to the present invention, 
along with the original pitch contour is shown in Figure 2. As shown in Figure 2, each 
linear segment is a straight line joining two points: a starting point and an end point. For 

1 5 example, the second line segment of the piece- wise pitch contour shown in Figure 2 is the 
straight line joining a point at f=1.22s and a point at *=1.29s. The number of pitch values 
in the time period from t=l .22s and t=l .29s is 8, including the starting point and the end 
point. 

In order to carry out the present invention, the speech coding system has an 
20 additional module for piece-wise pitch contour generation. As shown in Figure 3, the 

speech coding system 1 comprises an encoding module 10, which has a parametric speech 
coder 12 for processing the input speech signal in a plurality of segments. For each 
segment, the coder 12 determines a parametric representation 1 12 of the input signal. The 
parameters can be quantized or unquantized versions of the original parameters, depending 
25 on the speech coding system. A compression module 20, responsive to the parametric 
representation, reduces the pitch contour into a piece-wise pitch contour using e.g. a 
software program 22. The points on the piece- wise contour are then coded by a 
quantization module 24 into the bitstream 120 through a communication channel or stored 
in a storage medium 30. At the receiver end, a decoder 40 is used to generate a 
30 synthesized speech signal 140 based on the information in the received bitstream 130 
indicative of the piece-wise pitch contour and other speech parameters. 



10 
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The software program 22 in the piece-wise pitch contour generation module 20 
contains machine readable codes that process the pitch values in the pitch contour 
according to the flowchart 500 as shown in Figure 4. The flowchart 500 shows the 
iteration for selecting a straight line representing a linear segment of the piece-wise pitch 
5 contour (see Figure 2). Each straight line has a starting point Q(po) and an end point Q(p\). 
For the first linear segment, both the starting point Q(po) and the end point Q(p\) have to 
be selected. For all other linear segments, only the end point Q(pO has to be selected. The 
iteration starts at selecting a linear segment covering a time period that includes three pitch 
values. Thus, if the starting point is located at a first point in time and the end point is 

10 located at a second point in time, then there are three pitch values in the time period from 
the first point in time to the second point in time. Thus, i=2 is set at step 502. At step 504, 
the end point is selected to be a point near or on the pitch value at the second point in time. 
For the first linear segment, the starting point is selected to be a point near or on the pitch 
value at the first point in time. At step 506, the deviation between each of the pitch values 

15 in the time period from the first point in time to the second point in time and the straight 
line joining the starting point and the end point and is measured. Alternatively the 
deviation can be measured with certain intervals. At step 508, the deviation is compared 
with a predetermined error value in order to determine whether the current straight line is 
acceptable as a candidate. If the deviation at some pitch values within the time period 

20 exceeds the predetermined error value, the end point (along with the starting point if the 
linear segment is the first segment) is adjusted and the iteration process loops back to step 
506 until no adjustment is possible. If the current straight line is acceptable as determined 
at step 508, it is compared to the earlier results at step 510 in order to determine whether it 
is the best straight line so far. The best straight line so far is the one with the smallest sum 

25 of the absolute deviations among the straight lines with the same i already obtained so far. 
The best line so far is stored at step 512. The end point is again adjusted at step 520 until 
no adjustment is possible. 

When adjustment is no longer possible, as determined at step 520, it is time to 
determine whether to stop the iteration process and use the best line stored at step 512 as 

30 the current line segment, or to extend the line segment further by increasing i by 1 at step 
526 (unless the current i is already equal to /"max as determined at step 524). It is possible 
that, after increasing i by 1, no extended line is acceptable as determined at step 522. In 
that case, the best line with the previous i is used as straight line for the current segment. 

11 



PATENT 
944-003.191 

The number of candidates can be limited e.g. by setting a maximum limit for how much 
the endpoint can differ from the sample value. The intervals between different endpoint 
candidates can also be set to limit the amount of possible candidates. 

It should be noted that, in the pitch-wise pitch contour of Figure 2, the third linear 
5 segment covers only two pitch values at /=1 .29s and t=\ .30s. That is because t=l .30s is 
the point in time separating two speech signal segments. 

It should also be noted that the adjustment of the end point or the starting point can 
only be carried out in steps. For example, the adjustment of Q(p\) can be carried out by 
increasing or decreasing the value of Q(p\) by one quantization step. However, the 

10 adjustment can also be carried in smaller or larger steps. Furthermore, the limit of the 

longest line, or /"max, can be set at a large number, such as 64. In that case, the time period 
(and, therefore, i) between the starting point and the end point varies significantly. For 
example, i in the fourth line segment is equal to 5, while i in the fifth line segment is 23. 
However, if /max is set to 5, for example, then the time period (and i) in most or all linear 

15 segments is the same. Thus, this invention is applicable when i is variable and / m ax is 

variable or a fixed number. Also, the measured deviation between a segment candidate 
and the pitch values that is used to select the best candidate so far at step 510 can be the 
sum of absolute differences or other deviation measures. The generation of segment 
candidates may be limited by certain criteria, such as a pre-determined maximum absolute 

20 difference between each pitch value and the corresponding point in the segment candidate. 
For example, the maximum difference can be five or ten quantization steps, but it can be a 
smaller or a larger number. 

Furthermore, the present invention as described above can be modified without 
departing the basic concept of modified pitch contour quantization. First, different 

25 optimization techniques can be used. Second, the modified pitch contour does not have to 
be piece-wise linear as long as the number of pitch values to be transmitted can be kept 
low. Third, the quantization techniques used for coding the pitch values and the time 
distances can be modified. Fourth, it is possible to construct the alternative pitch contour 
already during pitch estimation. 

30 Moreover, the embodiment described above is not by any means the only 

implementation alternative. For example, the optimization technique used in determining 
the new pitch contour can be freely selected. In addition, the new pitch contour does not 
have to be piece-wise linear. For example, it is possible to describe the contour using 

12 
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splines, polynomials, discrete cosine transform etc. For example, a non-linear contour can 
have the following general form: 

Q(P) = Q(Po) + ai[(Q(Pi) - Q(po)/(t { - to)] (t - to) 
5 + a 2 [(Q(pd - Q(po)Kt\ - 'o)] 2 (* - 'o) 2 + • • - t x > t > t Q 

In this case, while the end points are updated as needed, it is sufficient to provide the 
algorithm to the decoder only once. 

10 General Discussion 

The search for the optimal simplified model of the pitch contour can be formulated 
as a mathematical optimization problem. Let/(f) denote the function that describes the 
original pitch contour in the range from 0 to t max . Furthermore, let g(t) denote the 
simplified pitch contour and d(f{i), g(t)) denote the deviation between the two contours at 
15 time instant t. Now, the optimization problem to be solved is to find the simplified pitch 
contour g(t) that satisfies two optimality conditions: 

(I) The number of bits needed for describing the contour g(t) is minimized. 
01)d(f(t%g(t))<hm) forall0<f<f max , 
where /*(•) defines the maximum allowable deviation from the original pitch contour. 
20 From the set of contours that satisfy both conditions, the contour function that minimizes 
the total deviation, 

D= jd(f(t),g(t)), (1) 

25 is selected as the final simplified contour. 

In general, the above optimization problem is unsolvable. However, the problem 
can be solved if its generality is reduced by fixing the pitch contour model. For example, 
in a piece-wise linear model, the function g(t) can be described using the points in which 
the derivative of g(t) changes. Let q n and t n denote the coordinates of the nth such point 

30 (1 < n < N 9 where N is the number of these points in the piece-wise linear model). The 
simplified contour can be defined in Af-1 linear pieces as 

13 
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where 1 < n < N -1 . To make the definition complete, it is required that t n < t n +\, and that 
t\ = 0 and t N = ^max. In addition, it is required that all values of q n are within the finite 
range from <7 m j n to ^max- With this model, the optimization problem reduces to the search 
for the set of points (/„, q„) that describes the contour g{t) that satisfies the conditions (I) 
and (II) and minimizes the total deviation in Eq. 1. Now, by making the reasonable 
assumption that the point coordinates can only be represented with a limited resolution, 
the problem becomes solvable since the points are located in a grid with a finite number of 
possible point locations. This assumption does not reduce the generality of the formulation 
since the finite accuracy follows directly from the optimality condition (I). 

Solutions for the problem 

The optimization problem formulated in the last section can be solved in many 
ways. Here, two solutions are described. The first one is computationally burdensome but 
is always capable of finding the global optimum whereas the second solution is very 
simple but produces only sub-optimal results. In both solutions, we assume that the pitch 
values q n are coded into bits using a scalar quantizer with a codebook C = {c\ 9 c 2 , ca/}, 
and that the time indices t n are integer multiples of some time unit T. Furthermore, we 
assume that both C and Tare selected in such a manner that a solution exists, and make the 
reasonable additional assumption that the number of bits needed for describing the contour 
can be minimized by minimizing N (the number of points needed for defining the 
simplified contour). 

Globally optimal approach 

The globally optimal solution can be achieved using the following straightforward 
brute force algorithm: 

Step 1. Initialization. SetN= 1. 

Step 2. Set N = N+ 1 . Can we find a suitable piece-wise linear model with the current N7 
If yes, then go to Step 3. Otherwise, repeat Step 2. 



14 
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Step 3. Exit and code the simplified contour. If there are several suitable contour 
candidates, select the one that minimizes the total deviation in Eq. 1 . 

The test in Step 2 can be performed by checking all suitable piece-wise linear 
contour candidates (with the current AO against the optimality condition (II). During the 
first iteration (N= 2), the candidates are all the lines with the endpoints (t\, q\) and (t 2 , q 2 ) 
that satisfy the condition 



In this case, the time indices are fixed to t\ = 0 and t 2 = t max . The values of q\ and q 2 are 
selected from the codebook C, and thus there is only a limited number of candidates. 
During the second iteration (N= 3), the contour candidates have two (N— 1) linear pieces. 
This time the first and the last time indices (t\ and tz) are fixed to 0 and / max whereas the 
time index t 2 can be adjusted in the range from Tto - T with steps of T. Again, the 
values of q n are selected from the codebook C. Similarly, with some arbitrary N the 
simplified contour consists of N - 1 linear pieces and N - 2 of the time indices can be 
adjusted. 

It is easy to see that the above algorithm always finds the optimal contour 
candidate since the check in Step 2 takes care of the condition (II), the iterative process 
guarantees that the condition (I) is satisfied, and the total deviation is minimized in Step 3. 
However, it is also easy to see that the complexity of this algorithm grows extremely fast 
with increasing problem size. More precisely, we can state that in the worst case the 
algorithm goes through 



different contour candidates. In the above equation, b denotes the maximum number of 
codebook entries that can satisfy the condition of Eq. 3 and m = (t^x / T) — 1 . 

In a practical situation, these variables could be, for example, b = 3 and m = 62, 
leading to about 1.9 10 38 contour candidates in the worst case. Consequently, it can be 
concluded that this theoretically optimal approach can only be used when b and m are 



(3) 




(4) 
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small (for example, when b = 3 and m = 8, the worst-case number of candidates is 589824) 
and thus this approach is not suitable for most practical implementations. 

Simple sub-optimal approach 
5 As demonstrated earlier, the optimization process may require large amounts of 

computation if the target is to always find the globally optimal piece-wise linear contour. 
However, quite good results can be achieved with the very simple and computationally 
efficient technique (in which the complexity grows only linearly with increasing problem 
size) described in this section. In addition to its simplicity, one advantage of this approach 

10 is that the whole pitch contour is not processed at once but instead only a relatively small 
look-ahead is required. 

The main idea in the simplified approach is to go through the optimization process 
one linear piece at a time. For each linear piece, the maximum length line that can keep 
the deviation from the true contour low enough is searched without using knowledge of 

15 the contour outside the boundaries of the linear piece. Within this optimization technique, 
there are two cases that have to be considered separately: the first linear piece and the 
other linear pieces. The case of the first linear piece occurs at the beginning when the 
encoding process is started. In addition, if no pitch values are transmitted for inactive or 
unvoiced speech, the first linear pieces after these pauses in the pitch transmission fall to 

20 this category. In both situations concerning the first linear piece, both ends of the line are 
optimized. Other cases fall in to the second category in which the starting point for the 
line has already been fixed in the optimization of the previous linear piece and thus only 
the location of the end point is optimized. 

In the case of the first linear piece, the process starts by selecting the quantized 

25 pitch values at the time indices 0 and Tas the best end points for the line found so far. 

Then, the actual iteration begins by considering the cases where the ends of the line are 
close enough to the original pitch values at time indices 0 and 2T. In other words, the 
candidates for the start point are all the quantized pitch values that are close enough to the 
original pitch value at t\ = 0 such that the criterion for the desired accuracy (given in Eq. 

30 3) is satisfied. Similarly, the candidates for the end point are the quantized pitch values 

that are close enough to the original pitch value at t 2 = 2T. After the candidates have been 
found, all the possible start point and end point combinations are tried out: the accuracy of 
the linear representation is measured in the time interval between t\ and t 2 , and the 
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candidate line can be accepted as a part of the piece-wise linear contour if the accuracy 
criterion is satisfied. Furthermore, if the deviation from the original pitch contour is 
smaller than with the other lines accepted during this iteration step, the line is selected as 
the best line found so far. If at least one of the candidates is accepted, the iteration is 
5 continued by repeating the process after increasing ft by a step of size T. If none of lines 
is accepted, the optimization process is terminated and the best end points found during 
the previous iteration are selected as the first points of the piece-wise linear pitch contour. 

In the case of other linear pieces, only the location of the end point can be 
optimized since the start point has already been fixed during the optimization of the 

10 previous linear piece. The process is started by selecting the quantized pitch value located 
an interval of T after the fixed starting point as the best end point for the line found so far. 
(Let (/„_], q n ~\) and (/„, q n ) denote the fixed start point and the end point to be optimized, 
respectively.) Then, the iteration is started by taking one more time step into the 
consideration, i.e. t n = t n -\ + 2T. The candidates for the end point for the line are the 

1 5 quantized pitch values that are close enough to the original pitch value at the new t n such 
that the criterion for the desired accuracy is satisfied. After finding the candidates, the rest 
of the process is similar to the case of the first linear piece. 

In both cases described above in detail, the iteration can be finished prematurely 
for two reasons. First, the process is terminated if t n cannot be increased because the 

20 original pitch contour ends before t n + T. This may happen if the whole look-ahead buffer 
has been used, if the speech signal to be encoded has ended, or if the pitch transmission 
has been paused during inactive or unvoiced speech. Second, it is possible to limit the 
maximum length of a single linear part in order to code the time indices of the points more 
efficiently. For both cases, these issues can be taken into account by setting a limit / nmax 

25 based on the duration of the available pitch contour and on the maximum time-distance 

between the ends of the line. This approach is illustrated in flowchart 600 in the Figure 5, 
which shows the optimization process for one linear piece. 

The flowchart 600 shows the iteration for selecting a straight line representing one 
linear segment of the piece-wise pitch contour. The straight line has a starting point Q(f[t n - 

30 i)) and an end point Q(/fo,-)). For the first linear segment, both the starting point Q(/(/ n -i)) 
and the end point Q(f{t n )) have to be selected. For all other linear segments, only the end 
point Q(/fo)) has to be selected. The iteration starts at selecting a linear segment starting 
at t n = / n _i + T. The starting point Q(/fr n -i)) and the end point Q(/(/ n -)) are considered as the 
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best end points so far. Thus, at step 602, set t n = t n + T. At step 604, the end point is 
selected to be a point near/(f n ). For the first linear segment, the starting point is near/(* n . 
i). For all other segments, the starting point is fixed. At step 606, the deviation between 
the candidate line and each of the pitch values in the time period from / n -i to t n is 
5 measured. At step 608, the deviation is compared with a predetermined error value in 
order to determine whether the current straight line is acceptable as a candidate. If the 
deviation at some pitch values within the time period exceeds the predetermined error 
value, the end point (along with the starting point if the linear segment is the first segment) 
is adjusted and the iteration process loops back to step 606 until no adjustment is possible. 

10 If the current straight line is acceptable as determined at step 608, it is compared to the 
earlier results at step 610 in order to determine whether it is the best straight line so far. 
The best straight line so far is the one with the smallest sum of the absolute deviations 
among the straight lines with the same i already obtained so far. The best line so far is 
stored at step 612. The end point is again adjusted at step 620 until no adjustment is 

1 5 possible. 

When adjustment is no longer possible, as determined at step 620, it is time to 
determine whether to stop the iteration process and use the best line stored at step 612 as 
the current line segment, or to extend the line segment further by increasing t n by T at step 
626 (unless the current t n is already equal to t max as determined at step 624). It is possible 
20 that, after increasing t n by T, no extended line is acceptable as determined at step 622. In 
that case, the best line with the previous t n is used as straight line for the current segment. 
The number of candidates can be limited e.g. by setting a maximum limit for how much 
the endpoint can differ from the sample value. The intervals between different endpoint 
candidates can also be set to limit the amount of possible candidates. 

25 

Practical implementation 

The pitch contour quantization technique introduced in this paper is included in a 
practical speech coder designed for storage applications. The coder operates at very low 
bit rates (about 1 kbps) and processes the 8 kHz input speech in segments of variable 
30 duration (between 20 and 640 ms). In the practical implementation, the simple sub- 
optimal approach is used and only the pitch contour located in the current segment is 
considered in the optimization. During unvoiced or inactive segments, no pitch 
information is coded. The variable Tis set to 10 ms that is equal to the pitch estimation 
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interval. Furthermore, the continuous pitch contour is approximated using the discrete 
contour formed by the estimated pitch values pk (at 10 ms intervals). Consequently, the 
optimality condition (II) is changed into 



5 



d(p k , gikT)) < h(p k ) for all 0 < k < r max / T. 



(5) 



In addition, the minimization of the total distortion in Eq. 1 is approximated with the 
minimization of 



10 



D = '^d(p k ,g{kTy), 



(6) 



where the function d is defined as the absolute error, i.e. d(x,y) = \ x —y 



The function h that defines the maximum allowable coding error for a given pitch 
value is determined as 



The same function is also used in the generation of the codebook C used in scalar 
quantization of the pitch values q n . The entries of the 32-level (5-bit) codebook C are 

20 computed using cj = cj-\ + h(cj-\) with c\ = 19. This codebook covers the pitch period 

range used in the coder and is quite consistent with the experimental findings. Moreover, 
this codebook and function h approximately follow the theory of critical bands in the sense 
that the frequency resolution of the human ear is assumed to decrease with increasing 
frequency. To further enhance the perceptual performance, the quantization is done in 

25 logarithmic domain. 

The time indices are coded for one segment at a time using differential 
quantization, with the exception that the time-distance is not coded at all for the first point 
of each segment since t\ is always 0. In the differential coding scheme, a given time index 
is coded using the time-distance between it and the previous time index in steps of size T. 

30 More precisely, the value of a given t n is coded by converting ((/„ — /„_i) / T) — 1 into the 
binary representation containing riog 2 (/max- 1)1 bits, where denotes the maximum 
length that would have been allowed for the current linear piece. One additional trick is 



15 



h(p h ) = max(2, 480 p k 1 8000) . 



(7) 
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used in our implementation to increase coding efficiency: If the number of time indices to 
be coded is more than half of the number of pitch estimation instants in the segment, the 
"empty" time indices are coded instead of the time indices t n (and one bit is used to 
indicate which coding scheme is used). However, it should be noted that the efficiency of 
5 this trick is enabled by the segmental processing used in the storage coder implementation. 
In a general case with continuous frame-based processing, a better way would be to use 
some lossless coding technique, such as Huffman coding, directly on the time distance 
values. 

The implementation described above is capable of coding the pitch contour with 
10 the average bit rate of approximately 100 bps in such a manner that the deviation from the 
original contour remains below the maximum allowable deviation defined in Eq. 7. 
Despite the very low bit rate, the coded pitch contour is quite close to the original contour. 
The average and the maximum absolute coding errors are about 1.16 and 5.12 samples, 
respectively, at 99 bps. When judged by expert listeners, the coded contour could be 
15 easily distinguished from the original contour but the coding error is not particularly 
annoying. The pitch quantization technique has not been tested explicitly with naive 
listeners; however, a formal listening test indicated that the storage coder containing the 
proposed pitch quantization technique outperformed a 1.2 kbps state-of-the-art reference 
coder by a wide margin despite the average bit rate reduction of more than 200 bps (for the 
20 pitch alone, the reduction is about 70 bps). 

In sum, the present invention exploits the fact that a typical pitch contour evolves 
fairly smoothly but contains occasional rapid changes in order to construct a piece-wise 
linear pitch contour that closely follows the shape of the original contour but contains less 

25 information to be coded. For example, only the points of the piece-wise linear pitch 

contour where the derivative changes are quantized. During unvoiced speech, a constant 
default pitch value can be used both at the encoder and at the decoder. Furthermore, the 
properties of human hearing are exploited by allowing larger deviations from the true pitch 
contour in cases where the pitch frequency is low. The present invention offers a 

30 substantial reduction in the bit rate required for perceptually sufficient quantization 

accuracy: with the proposed quantization technique an accuracy level close to that of a 
conventional pitch quantizer operating at 500 bps (5 -bit quantizer, 100 pitch values per 
second) can be reached at an average bit rate of about 100 bps. If lossless compression is 
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used to supplement the method described in this invention report, it is possible to even 
further reduce the bit rate to about 80 bps, for example. 
The main utilities of the invention include: 

- It is possible to use a significantly lower average update rate than with the prior-art 
5 techniques. 

- The piece- wise linear pitch contour can be reconstructed at the decoder in such a manner 
that it is very close to the true pitch contour. 

- The invention takes into account the fact that the human ear is more sensitive to pitch 
changes when the pitch frequency is low. 

10 - The technique enables considerable reductions in the bit rate. 

- The invention can be implemented as an additional block that can be used with existing 
speech coders. 

The present invention is suitable for storage applications and it has been 
successfully used in a speech coder designed for pre-recorded audio messages. In the 

15 target application, the audio messages (audio menus) are recorded and encoded off-line on 
a computer. The resulting low-rate bitstream can then be stored and decoded locally in a 
mobile terminal. The low-rate bitstream can be provided by a component in a 
communication network, as shown in Figure 6. Figure 6 is a schematic representation of a 
communication network that can be used for coder implementation regarding storage of 

20 pre-recorded audio menus and similar applications, according to the present invention. As 
shown in the figure, the network comprises a plurality of base stations (BS) connected to a 
switching sub-station (NSS), which may also be linked to other networks. The network 
further comprises a plurality of mobile stations (MS) capable of communicating with the 
base stations. The mobile station can be a mobile terminal, which is usually referred to as 

25 a complete terminal. The mobile station can also be a module for terminal without a 
display, keyboard, battery, cover etc. The mobile station may have a decoder 40 for 
receiving a bitstream 120 from a compression module 20 (see Figure 3). The compression 
module 20 can be located in the base station, the switching sub-station or in another 
network. 

30 Although the invention has been described with respect to a preferred embodiment 

thereof, it will be understood by those skilled in the art that the foregoing and various 
other changes, omissions and deviations in the form and detail thereof may be made 
without departing from the scope of this invention. 
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